Goto

Collaborating Authors

 deep speech 2


Towards an ImageNet Moment for Speech-to-Text

#artificialintelligence

Speech-to-text (STT), also known as automated-speech-recognition (ASR), has a long history and has made amazing progress over the past decade. Currently, it is often believed that only large corporations like Google, Facebook, or Baidu (or local state-backed monopolies for the Russian language) can provide deployable "in-the-wild" solutions. Following the success and the democratization (the so-called "ImageNet moment", i.e. the reduction of hardware requirements, time-to-market and minimal dataset sizes to produce deployable products) of computer vision, it is logical to hope that other branches of Machine Learning (ML) will follow suit. The only questions are, when will it happen and what are the necessary conditions for it to happen? If the above conditions are satisfied, one can develop new useful applications with reasonable costs. Also democratization occurs - one no longer has to rely on giant companies such as Google as the only source of truth in the industry.


How long have we got before humans are replaced by artificial intelligence?

#artificialintelligence

My view, and that of the majority of my colleagues in AI, is that it'll be at least half a century before we see computers matching humans. Given that various breakthroughs are needed, and it's very hard to predict when breakthroughs will happen, it might even be a century or more. If that's the case, you don't need to lose too much sleep tonight. One reason for believing that machines will get to human-level or even superhuman-level intelligence quickly is the dangerously seductive idea of the technological singularity. This idea can be traced back to a number of people over fifty years ago: John von Neumann, one of the fathers of computing, and the mathematician and Bletchley Park cryptographer IJ Good. More recently, it's an idea that has been popularised by the science-fiction author Vernor Vinge and the futurist Ray Kurzweil.


Facebook details wav2vec, an AI algorithm that uses raw audio to improve speech recognition

#artificialintelligence

Automatic speech recognition, or ASR, is a foundational part of not only assistants like Apple's Siri, but dictation software such as Nuance's Dragon and customer support platforms like Google's Contact Center AI. It's the thing that enables machines to parse utterances for key phrases and words and that allows them to distinguish people by their intonations and pitches. Perhaps it goes without saying that ASR is an intense area of study for Facebook, whose conversational tech is used to power Portal's speech recognition and who is broadening the use of AI to classify content on its platform. To this end, at the InterSpeech conference earlier this year the Menlo Park company detailed wave2vec, a novel machine learning algorithm that improves ASR accuracy by using raw, untranscribed audio as training data. Facebook claims it achieves state-of-the-art results on a popular benchmark while using two orders of magnitude less training data and that it demonstrates a 22% error reduction over the leading character-based speech recognition system, Deep Speech 2. Wav2vec was made available earlier this year as an extension to the open source modeling toolkit fairseq, and Facebook says it plans to use wav2vec to provide better audio data representations for keyword spotting and acoustic event detection.



Deep Speech 3: Even more end-to-end speech recognition - Baidu Research

#artificialintelligence

Accurate speech recognition systems are vital to many businesses, whether they are a virtual assistant taking commands, video reviews that understand user feedback, or improve customer service. However, today's world-class speech recognition systems can only function with user data from third party providers or by recruiting graduates from the world's top speech and language technology programs. At Baidu Research, we have been working on developing a speech recognition system that can be built, debugged, and improved by a team with little to no experience in speech recognition technology (but with a solid understanding of machine learning). We believe a highly simplified speech recognition pipeline should democratize speech recognition research, just like convolutional neural networks revolutionized computer vision. Along this endeavor we developed Deep Speech 1 as a proof-of-concept to show a simple model can be highly competitive with state-of-art models.


Propelling Deep Learning at Scale at Baidu AI Lab

#artificialintelligence

Researchers from Baidu's Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. The technique, a modified version of the OpenMPI algorithm "ring all-reduce," is being used at Baidu to parallelize the training of their speech recognition model, Deep Speech 2, across many GPU nodes. The two pieces of software Baidu is announcing today are the baidu-allreduce C library, as well as a patch for TensorFlow, which allows people who have already modeled in TensorFlow to compile this new version and use it for parallelizing across many devices. The codes are available on GitHub. Baidu's SVAIL team developed the approach about two years ago for their internal deep learning framework, named Gene and Majel (in tribute to the famous Star Trek creator and the actress who voiced the onboard computer interfaces for the series).


Quick Pro Quo: Software Writes Text 3x Faster Than Any Human Can

#artificialintelligence

Earlier this year, we watched a world-renowned Go mastermind get pummeled in a complex game by an artificial intelligence (AI). Now, humans are about to lose in yet another battle with the machines--and this time, it's over typing. There is a speech recognition software that has improved to the point that it is faster and more accurate at producing text than human typists. That's according to researchers from Stanford University and the University of Washington, which ran a study on a new program developed by Chinese internet giant, Baidu. Baidu's Deep Speech 2 is a cloud-based voice recognition software based on a deep learning neural network.


Google, Baidu and the race for an edge in the global speech recognition market

#artificialintelligence

Daniel Faggella is founder of TechEmergence, a news and advice website for entrepreneurs and investors interested in the intersection of technology and the mind. Speech recognition technology has been around for more than half a decade, though the early uses of speech recognition -- like voice dialing or desktop dictation -- certainly don't seem as sexy as today's burgeoning virtual agents or smart home devices. If you've been following the speech recognition technology market for any length of time, you know that a slew of significant players emerged on the scene about six years ago, including Google, Apple, Amazon and Microsoft (in a brief search, I counted 26 U.S.-based companies developing speech recognition technology). Since that time, the biggest tech trend setters in the world have been picking up speed and setting new benchmarks in a growing field, with Google recently providing open access to its new enterprise-level speech recognition API. While Google certainly seems to have the current edge in the market after substantial investments in machine learning systems over the past couple of years, the tech giant may yet have a potential Achilles' heel in owning an important segment of the global market -- lack of access to China.


Baidu's Deep-Learning System Rivals People at Speech Recognition

#artificialintelligence

China's leading Internet-search company, Baidu, has developed a voice system that can recognize English and Mandarin speech better than people, in some cases. The new system, called Deep Speech 2, is especially significant in how it relies entirely on machine learning for translation. Whereas older voice-recognition systems include many handcrafted components to aid audio processing and transcription, the Baidu system learned to recognize words from scratch, simply by listening to thousands of hours of transcribed audio. The technology relies on a powerful technique known as deep learning, which involves training a very large multilayered virtual network of neurons to recognize patterns in vast quantities of data. The Baidu app for smartphones lets users search by voice, and also includes a voice-controlled personal assistant called Duer (see "Baidu's Duer Joins the Personal Assistant Party").